-
Notifications
You must be signed in to change notification settings - Fork 128
Introduce Table wrapper, unify table registration via register_table; deprecate legacy APIs #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
docs/tests, add DataFrame view support, and improve Send/concurrency support. migrates the codebase from using `Table` to a `TableProvider`-based API, refactors registration and access paths to simplify catalog/context interactions, and updates documentation and examples. DataFrame view handling is improved (`into_view` is now public), the test-suite is expanded to cover new registration and async SQL scenarios, and `TableProvider` now supports the `Send` trait across modules for safer concurrency. Minor import cleanup and utility adjustments (including a refined `pyany_to_table_provider`) are included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule improvements. -- Registration logic & API * Refactor of table provider registration logic for improved clarity and simpler call sites. * Remove PyTableProvider registration from an internal module (reduces surprising side effects). * Update table registration method to call `register_table` instead of `register_table_provider`. * Extend `register_table` to support `TableProviderExportable` so more provider types can be registered uniformly. * Improve error messages related to registration failures (missing PyCapsule name and DataFrame registration errors). -- DataFrame ↔ TableProvider conversions * Introduce utility functions to simplify table provider conversions and centralize conversion logic. * Rename `into_view_provider` → `to_view_provider` for clearer intent. * Fix `from_dataframe` to return the correct type and update `DataFrame.into_view` to import the correct `TableProvider`. * Remove an obsolete `dataframe_into_view` test case after the refactor. -- FFI / PyCapsule handling * Update `FFI_TableProvider` initialization to accept an optional parameter (improves FFI ergonomics). * Introduce `table_provider_from_pycapsule` utility to standardize pycapsule-based construction. * Improve the error message when a PyCapsule name is missing to help debugging. -- DeltaTable & specific integrations * Update TableProvider registration for `DeltaTable` to use the correct registration method (matches the new API surface). -- Tests, docs & minor fixes * Add tests for registering a `TableProvider` from a `DataFrame` and from a capsule to ensure conversion paths are covered. * Fix a typo in the `register_view` docstring and another typo in the error message for unsupported volatility type. * Simplify version retrieval by removing exception handling around `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance. * Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`. * Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code. * Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions. * Added type hints to `register_schema` and `deregister_table` methods. * Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly. * Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`. Notes: * Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration. * Audit any code relying on `Send` for trait objects passed across threads. * Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes * Normalized table-provider handling and simplified registration flow across the codebase; multiple commits centralize provider coercion and normalization. * Introduced utility helpers (`coerce_table_provider`, `extract_table_provider`, `_normalize_table_provider`) to centralize extraction, error handling, and improve clarity. * Simplified `from_dataframe` / `into_view` behavior: clearer implementations, direct returns of DataFrame views where appropriate, and added internal tests for DataFrame flows. * Fixed DataFrame registration semantics: enforce `TypeError` for invalid registrations; added handling for `DataFrameWrapper` by converting it to a view. * Added tests, including a schema registration test using a PyArrow dataset and internal DataFrame tests to cover new flows. * Documentation improvements: expanded `from_dataframe` docstrings with parameter details, added usage examples for `into_view`, and documented deprecations (e.g., `register_table_provider` → `register_table`). * Warning and UX fixes: synchronized deprecation `stacklevel` so warnings point to caller code; improved `__dir__` to return sorted, unique attributes. * Cleanup: removed unused imports (including an unused error import from `utils.rs`) and other dead code to reduce noise.
…dating method calls
c47b0f1
to
ea2973c
Compare
ea2973c
to
1872a7f
Compare
…d avoid documentation duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incredible start!
From a naming perspective I think it's more intuitive to just call these Table
instead of TableProvider
. I know we have a Table
class in datafusion.catalog
. It feels this is is a real opportunity to give the user a more unified experience even further.
If we are going to be making big changes like this and deprecating some functions, then I really want to make sure we give an extremely pleasant end user experience.
dev/changelog/49.0.0.md
Outdated
**Deprecations:** | ||
|
||
- Document that `SessionContext.register_table_provider` is deprecated in favor of `SessionContext.register_table`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this change.
docs/source/conf.py
Outdated
# Skip private members that start with underscore to avoid duplication | ||
if name.split(".")[-1].startswith("_") and what in ("data", "variable"): | ||
skip = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can understand better, why do we need both this rule and the one above in lines 86-88?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The explicit skip_contents list handles targeted, known problem cases (re-exports, specific deprecated APIs, or particular items that cause duplication or confusion). It’s precise and intentional.
- The private-name filter is a broad rule to remove many small implementation details (module-level private constants) without listing them all manually. This prevents the docs from listing every private variable.
I'll also add clarifying comments in autoapi_skip_member_fn
provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__()) | ||
ctx.register_table("my_delta_table", provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a worse experience than before. Why can we not just call register_table("my_delta_table", delta_table)
?
capsule = provider.__datafusion_table_provider__() | ||
capsule_provider = TableProvider.from_capsule(capsule) | ||
df = ctx.from_pydict({"a": [1]}) | ||
view_provider = TableProvider.from_dataframe(df) | ||
# or: view_provider = df.into_view() | ||
ctx.register_table("capsule_table", capsule_provider) | ||
ctx.register_table("view_table", view_provider) | ||
ctx.table("capsule_table").show() | ||
ctx.table("view_table").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example takes a bit of cognitive load to understand what we're demonstrating.
First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to TableProvider
or register_table
without having to think about or understand these mechanics behind the scene.
python/datafusion/__init__.py
Outdated
# isort: skip_file # Prevent import-sorting linter errors (I001) | ||
# ruff: noqa: I001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ruff lint causing a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove them.
python/datafusion/dataframe.py
Outdated
This is the preferred way to obtain a view for | ||
:py:meth:`~datafusion.context.SessionContext.register_table`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the reasons:
1. Direct API: Most efficient path - directly calls the underlying Rust
DataFrame.into_view()
method without intermediate delegations.
2. Clear semantics: The into_
prefix follows Rust conventions,
indicating conversion from one type to another.
3. Canonical method: Other approaches like TableProvider.from_dataframe
delegate to this method internally, making this the single source of truth.
4. Deprecated alternatives: The older TableProvider.from_view
helper
is deprecated and issues warnings when used.
I will add the above to the comment in def to_view too
python/datafusion/dataframe.py
Outdated
>>> from datafusion import SessionContext | ||
>>> ctx = SessionContext() | ||
>>> df = ctx.sql("SELECT 1 AS value") | ||
>>> provider = df.into_view() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an end user's perspective, they turn a dataframe into a view, which they then register so they can use it later. I don't think this end user needs to understand the concept of TableProvider at all. In the example I would change the variable name provider
to view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, given that we're moving away from 'provider'
…ge and advantages
…age of Table instead
245e89f
to
918b1ce
Compare
I removed TableProvider class in Python. |
TableProvider
wrapper & unified register_table
API; deprecate register_table_provider
Which issue does this PR close?
Rationale for this change
This change consolidates and modernizes table provider registration in DataFusion's Python bindings. Previously, there were multiple ad-hoc mechanisms (
register_table_provider
,Table.from_view()
, directTable
or pycapsule usage) that led to confusing APIs, inconsistent behaviors, and fragmented documentation.This PR introduces a clean, centralized approach using the high-level
Table
wrapper class and a normalization layer that supports multiple table provider inputs, including:Table
objectsDataset
sDataFrame
viewsBy consolidating registration into
SessionContext.register_table()
and extendingSchema.register_table()
to match, we simplify the user experience, reduce internal complexity, and align the API more closely with Pythonic expectations.What changes are included in this PR?
High-level Summary
Introduces a new high-level Python API:
datafusion.Table
.from_capsule()
,.from_dataframe()
, and.from_dataset()
Deprecates
SessionContext.register_table_provider()
in favor ofregister_table()
Deprecates
Table.from_view()
in favor ofDataFrame.into_view()
andTable.from_dataframe()
Updates
Schema.register_table()
to support any object implementing__datafusion_table_provider__
andpyarrow.dataset.Dataset
Adds
_normalize_table_provider
utility to coerce supported input typesCentralizes coercion logic in Rust with
coerce_table_provider
andtable_provider_from_pycapsule()
Enhances documentation and examples to reflect modern registration idioms
Improves test coverage for new usage patterns and coercion logic
Introduces
datafusion.EXPECTED_PROVIDER_MSG
for stable, testable error messagesAre these changes tested?
Yes. This PR includes comprehensive test coverage:
Unit tests for new
Table
methods and error handlingIntegration tests verifying:
Table.from_dataframe()
,from_capsule()
, andinto_view()
pyarrow.dataset.Dataset
objectsDeprecationWarning
SessionContext
andSchema
registration paths behave identically__datafusion_table_provider__
can be used directlyDataFrame
without conversion)Are there any user-facing changes?
✅ Additions
New public API:
datafusion.Table
Table.from_dataframe(df)
Table.from_capsule(capsule)
Table.from_dataset(dataset)
DataFrame.into_view()
— recommended way to convert to a table providerdatafusion.EXPECTED_PROVIDER_MSG
— stable constant for validation errorsSchema.register_table(...)
now accepts all supported inputs (likeSessionContext.register_table
)SessionContext.register_table_provider(...)
is deprecatedregister_table
Table.from_view()
is deprecatedDeprecationWarning
; useinto_view()
orfrom_dataframe()
instead📋 Documentation & Examples
Table
andregister_table
DataFrame
objects🔁 Compatibility
Fully backwards compatible
Existing table registration logic continues to work as expected
Encourages migration to the new
Table
API for consistency and future-proofingBreaking changes?
No. This is a non-breaking refactor that preserves all existing behaviors through shims and deprecation paths. However, users relying on internal or undocumented APIs (e.g., raw table objects or bypassing coercion) may encounter changes.